Interactive map for AtlantECO project

Python

Author

Kate S [Ekaterina Sakharova] (MGnify team)

This is a static preview

You can run and edit these examples interactively on Galaxy

Mapping samples from the AtlantECO Super Study

… using the MGnify API and an interactive map widget

The MGnify API returns JSON data. The jsonapi_client package can help you load this data into Python, e.g. into a Pandas dataframe.

This example shows you how to load a MGnify AtlantECO Super Study’s data from the MGnify API and display it on an interactive world map

You can find all of the other “API endpoints” using the Browsable API interface in your web browser. The URL you see in the browsable API is exactly the same as the one you can use in this code.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.

Content: - Fetch AtlantECO studies data - Show study’ samples on the interactive world map - Check functional annotation terms presence/absense - GO-term - InterPro - Biosynthetic Gene Clusters (BGC)

Fetch all AtlantECO studies

A Super Study is a collection of MGnify Studies originating from a major project. AtlantECO is one such project, aiming to develop and apply a novel, unifying framework that provides knowledge-based resources for a better understanding and management of the Atlantic Ocean and its ecosystem services.

Fetch the Super Study’s Studies from the MGnify API, into a Pandas dataframe:

import pandas as pd
from jsonapi_client import Session, Modifier

atlanteco_endpoint = 'super-studies/atlanteco/flagship-studies'
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    studies = map(lambda r: r.json, mgnify.iterate(atlanteco_endpoint))
    studies = pd.json_normalize(studies)
studies[:5]

	type	id	attributes.samples-count	attributes.accession	attributes.bioproject	attributes.is-private	attributes.last-update	attributes.secondary-accession	attributes.centre-name	attributes.study-abstract	attributes.study-name	attributes.data-origination	relationships.biomes.data
0	studies	MGYS00006028	991	MGYS00006028	PRJNA656268	False	2024-05-21T20:40:41	SRP278138	Biological Global Ocean Ship-based Hydrographi...	The Global Ocean Shipboard Hydrographic Invest...	Bio-GO-SHIP: Global marine 'omics studies of r...	HARVESTED	[{'id': 'root:Environmental:Aquatic:Marine', '...
1	studies	MGYS00002392	1073	MGYS00002392	PRJEB6610	False	2024-04-15T20:15:45	ERP006157	GSC	Analysis of 18S DNA in Tara Oceans Protists si...	Amplicon sequencing of Tara Oceans DNA samples...	SUBMITTED	[{'id': 'root:Environmental:Aquatic:Marine', '...
2	studies	MGYS00006611	63	MGYS00006611	PRJEB55999	False	2024-03-01T18:01:09	ERP140920	Ocean Sampling Day Consortium	Ocean Sampling Day was initiated by the EU-fun...	18S rRNA amplicon sequencing from the Ocean Sa...	SUBMITTED	[{'id': 'root:Environmental:Aquatic:Marine', '...
3	studies	MGYS00006609	193	MGYS00006609	PRJEB9737	False	2024-03-01T17:30:57	ERP010877	GSC	Analysis of 18S DNA in Tara Oceans Polar Circl...	Amplicon sequencing of Tara Oceans Polar Circl...	SUBMITTED	[{'id': 'root:Environmental:Aquatic:Marine', '...
4	studies	MGYS00006608	62	MGYS00006608	PRJEB40757	False	2024-03-01T15:49:28	ERP124424	Ocean Sampling Day Consortium	Ocean Sampling Day was initiated by the EU-fun...	16S rRNA amplicon sequencing from the Ocean Sa...	SUBMITTED	[{'id': 'root:Environmental:Aquatic:Marine', '...

Show the studies’ samples on a map

We can fetch the Samples for each Study, and concatenate them all into one Dataframe. Each sample has geolocation data in its attributes - this is what we need to build a map.

It takes time to fetch data for all samples, so let’s show samples from chosen PRJEB46727 study only. This study contains assembly data https://www.ebi.ac.uk/metagenomics/studies/MGYS00005810#overview.

substudy = studies[studies['attributes.bioproject'] == 'PRJEB46727']
studies_samples = []

with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, study in substudy.iterrows():
        print(f"fetching {study.id} samples")
        samples = map(lambda r: r.json, mgnify.iterate(f'studies/{study.id}/samples?page_size=1000'))
        samples = pd.json_normalize(samples)
        samples = pd.DataFrame(data={
            'accession': samples['id'],
            'sample_id': samples['id'],
            'study': study.id, 
            'lon': samples['attributes.longitude'],
            'lat': samples['attributes.latitude'],
            'color': "#FF0000",
        })
        samples.set_index('accession', inplace=True)
        studies_samples.append(samples)
studies_samples = pd.concat(studies_samples)

fetching MGYS00005810 samples

print(f"fetched {len(studies_samples)} samples")

studies_samples.head()

fetched 83 samples

	sample_id	study	lon	lat	color
accession
SRS836773	SRS836773	MGYS00005810	-50.6225	-0.1569	#FF0000
SRS567311	SRS567311	MGYS00005810	-53.0000	7.2900	#FF0000
SRS567886	SRS567886	MGYS00005810	-54.4200	10.6800	#FF0000
SRS580500	SRS580500	MGYS00005810	-52.2200	12.4100	#FF0000
SRS580495	SRS580495	MGYS00005810	-54.5100	10.2900	#FF0000

import leafmap
m = leafmap.Map(center=(0, 0), zoom=2)
m.add_points_from_xy(
    studies_samples,
    x='lon', 
    y='lat', 
    popup=["study", "sample_id"], 
    color_column='color',
    add_legend=False
)
m

Check functional annotation terms presence

Let’s check whether a specific identifier is present in each sample.

We will work with MGnify analyses (MGYAs) corresponding to chosen samples. We filter analyses by - pipeline version: 5.0 - experiment type: assembly

This example shows how to process just the first 10 samples (again, because the full dataset takes a while to fetch). Firstly, get analyses for each sample.

analyses = []
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, sample in studies_samples[:10].iterrows():
        print(f"processing {sample.sample_id}")
        filtering = Modifier(f"pipeline_version=5.0&sample_accession={sample.sample_id}&experiment_type=assembly")
        analysis = map(lambda r: r.json, mgnify.iterate('analyses', filter=filtering))
        analysis = pd.json_normalize(analysis)
        analyses.append(analysis)
analyses = pd.concat(analyses)
analyses[:5]

processing SRS836773
processing SRS567311
processing SRS567886
processing SRS580500
processing SRS580495
processing SRS565748
processing SRS1776191
processing SRS581965
processing SRS837379
processing SRS1776201

type	id	attributes.analysis-status	attributes.pipeline-version	attributes.analysis-summary	attributes.experiment-type	attributes.accession	attributes.is-private	attributes.last-update	attributes.mgx-accession	attributes.complete-time	attributes.instrument-platform	attributes.instrument-model	relationships.assembly.data.id	relationships.assembly.data.type	relationships.study.data.id	relationships.study.data.type	relationships.sample.data.id	relationships.sample.data.type
analysis-jobs	MGYA00593142	completed	5.0	[{'key': 'Submitted nucleotide sequences', 'va...	assembly	MGYA00593142	False	2024-01-29T15:29:19.757516	MGX0004211	2021-12-06T19:31:17	ILLUMINA	Illumina HiSeq 2500	ERZ2945023	assemblies	MGYS00005810	studies	SRS836773	samples
analysis-jobs	MGYA00589840	completed	5.0	[{'key': 'Submitted nucleotide sequences', 'va...	assembly	MGYA00589840	False	2024-01-29T15:29:19.757516	MGX0003944	2021-10-22T17:25:45	ILLUMINA	Illumina Genome Analyzer IIx	ERZ2945090	assemblies	MGYS00005810	studies	SRS567311	samples
analysis-jobs	MGYA00589562	completed	5.0	[{'key': 'Submitted nucleotide sequences', 'va...	assembly	MGYA00589562	False	2024-01-29T15:29:19.757516	MGX0003943	2021-10-21T08:23:57	ILLUMINA	Illumina Genome Analyzer IIx	ERZ2945101	assemblies	MGYS00005810	studies	SRS567886	samples
analysis-jobs	MGYA00589561	completed	5.0	[{'key': 'Submitted nucleotide sequences', 'va...	assembly	MGYA00589561	False	2024-01-29T15:29:19.757516	MGX0003942	2021-10-21T08:16:10	ILLUMINA	Illumina Genome Analyzer IIx	ERZ2944669	assemblies	MGYS00005810	studies	SRS580500	samples
analysis-jobs	MGYA00589560	completed	5.0	[{'key': 'Submitted nucleotide sequences', 'va...	assembly	MGYA00589560	False	2024-01-29T15:29:19.757516	MGX0003941	2021-10-21T08:13:12	ILLUMINA	Illumina Genome Analyzer IIx	ERZ2944798	assemblies	MGYS00005810	studies	SRS580495	samples

Define functions: - identify_existance to check each analysis for identifier presence/absence. We add a column to the dataframe with a colour: blue if identifier was found and red if not. - show_on_map to plot results on the world map. Join the analyses and sample tables to have geolocation data and identifier presence data together (we’ll create a new sub-DataFrame with a subset of the fields and add them to the map).

def identify_existance(input_analyses, identifier, term):
    data = []
    with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
        for idx, mgya in input_analyses.iterrows():
            print(f"processing {mgya.id}")
            analysis_identifier = map(lambda r: r.json, mgnify.iterate(f'analyses/{mgya.id}/{identifier}'))
            analysis_identifier = pd.json_normalize(analysis_identifier)
            data.append("#0000FF" if term in list(analysis_identifier.id) else "#FF0000")
        presented = sum([1 for i in data if i == "#0000FF"])
        print(f"Presented {presented} of {identifier} {term}")
        input_analyses.insert(2, identifier, data, True)
    return input_analyses

def show_on_map(input_analyses, studies_samples, identifier):
    df = input_analyses.join(studies_samples.set_index('sample_id'), on='relationships.sample.data.id')
    df2 = df[[identifier, 'lon', 'lat', 'study', 'attributes.accession', 'relationships.study.data.id', 'relationships.sample.data.id', 'relationships.assembly.data.id']].copy()
    df2 = df2.set_index("study")
    df2 = df2.rename(columns={"attributes.accession": "analysis_ID", 
                              'relationships.study.data.id': "study_ID",
                              'relationships.sample.data.id': "sample_ID", 
                              'relationships.assembly.data.id': "assembly_ID"
                             })
    m = leafmap.Map(center=(0, 0), zoom=2)
    m.add_points_from_xy(df2, 
                         x='lon', 
                         y='lat', 
                         popup=["study_ID", "sample_ID", "assembly_ID", "analysis_ID", identifier],
                        color_column=identifier, add_legend=False)
    return m

GO term

This example is written for GO-term for biotin transport GO:0015878

Other GO identifiers are available on the MGnify API.

identifier = "go-terms"
go_term = 'GO:0015878'
go_analyses = analyses
go_data = identify_existance(go_analyses, identifier, go_term)
map_vis = show_on_map(go_data, studies_samples, identifier)
map_vis

processing MGYA00593142
processing MGYA00589840
processing MGYA00589562
processing MGYA00589561
processing MGYA00589560
processing MGYA00589559
processing MGYA00589558
processing MGYA00589557
processing MGYA00589556
processing MGYA00589555
Presented 9 of go-terms GO:0015878

InterPro entry

This example is written for InterPro entry IPR001650: Helicase, C-terminal domain-like

Other IPS identifiers are available on the MGnify API.

identifier = "interpro-identifiers"
ips_term = 'IPR001650'
ips_analyses = analyses
ips_data = identify_existance(ips_analyses, identifier, ips_term)
map_vis = show_on_map(ips_data, studies_samples, identifier)
map_vis

processing MGYA00593142
processing MGYA00589840
processing MGYA00589562
processing MGYA00589561
processing MGYA00589560
processing MGYA00589559
processing MGYA00589558
processing MGYA00589557
processing MGYA00589556
processing MGYA00589555
Presented 10 of interpro-identifiers IPR001650

BGC (Biosynthetic Gene Clusters)

MGnify has additional analysis of BGCs provided by Sanntis. These annotations are saved as RO-Crates objects and linked to assembly records.

The following example counts the number of truncated from beggining proteins of nearest MiBIG class Polyketide with dice distance more than 0.65. We will use gffutils for parsing GFF file.

Define a function to find GFF file in zipped archive by url:

import requests
from zipfile import ZipFile
from io import BytesIO

def find_gff_file(link):
    # Read archive and find GFF file
    
    response = requests.get(link)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Open the zip file from the content of the response
        with ZipFile(BytesIO(response.content)) as zip_file:
            # List all files in the zip archive
            file_list = zip_file.namelist()

            # Filter files with .gff extension
            gff_files = [file_name for file_name in file_list if file_name.endswith(".gff")]
            print(f"Found {gff_files}")
            # Read the first .gff file (you can modify this to read a specific file)
            if gff_files:
                first_gff_file = gff_files[0]
                gff_content = zip_file.open(first_gff_file).read()
                return gff_content
            else:
                print("No .gff files found in the zip archive.")
                return None
    else:
        print("Failed to fetch the zip file.")
        return None

Define a function to get counts from GFF.

import gffutils

def get_count(nearest_MiBIG_class, nearest_MiBIG_diceDistance, partial_value, gff_content):
    if gff_content:
        try:
            with tempfile.NamedTemporaryFile(delete=False) as temp_gff_file:
                temp_gff_file.write(gff_content)
                temp_gff_file_path = temp_gff_file.name

                # Create a GFF database using gffutils
                db = gffutils.create_db(
                    temp_gff_file_path,
                    dbfn=':memory:',  # Use an in-memory database
                    force=True,       # Overwrite if the database already exists
                    keep_order=True,  # Preserve feature order 
                )

                count = 0
                for feature in db.all_features():
                    if feature["nearest_MiBIG_class"][0] == nearest_MiBIG_class and \
                       float(feature["nearest_MiBIG_diceDistance"][0]) >= nearest_MiBIG_diceDistance and \
                       feature["partial"][0] == partial_value:
                        count += 1
                print(f"Count is {count}")
                return count
        except:
            print('Error in GFF DB')
            return 0
    else:
        return 0

Process data:

nearest_MiBIG_class = "Polyketide" 
nearest_MiBIG_diceDistance = 0.65
partial_value = "10"

counts = []

with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    for idx, mgya in analyses.iterrows():
        # get ERZ assembly accession
        assembly = mgya['relationships.assembly.data.id']
        annotations_for_assembly = mgnify.iterate(f'assemblies/{assembly}/extra-annotations')
        sanntis_annotation = None
        for item in annotations_for_assembly:
            if 'sanntis' in item.id:
                sanntis_annotation = item.links.self
                break
        if not sanntis_annotation:
            print('Sanntis annotation was not found')
            continue
        
        print(f"processing {mgya.id} {assembly}")
        
        gff_content = find_gff_file(sanntis_annotation)
        counts.append(get_count(nearest_MiBIG_class, nearest_MiBIG_diceDistance, partial_value, gff_content))

processing MGYA00593142 ERZ2945023
Found ['ERZ2945023_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589840 ERZ2945090
Found ['ERZ2945090_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589562 ERZ2945101
Found ['ERZ2945101_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589561 ERZ2944669
Found ['ERZ2944669_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589560 ERZ2944798
Found ['ERZ2944798_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589559 ERZ2944724
Found ['ERZ2944724_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589558 ERZ2944859
Found ['ERZ2944859_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589557 ERZ2944879
Found ['ERZ2944879_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589556 ERZ2945017
Found ['ERZ2945017_FASTA.gb.sanntis.full.gff']
Error in GFF DB
processing MGYA00589555 ERZ2944878
Found ['ERZ2944878_FASTA.gb.sanntis.full.gff']
Error in GFF DB

Display on the interactive map

identifier = "bgc"
analyses.insert(2, identifier, counts, True)
map_vis = show_on_map(analyses, studies_samples, identifier)
map_vis